Totally Model-Free Reinforcement Learning by Actor-Critic Elman Networks in Non-Markovian Domains

نویسندگان

  • Eiji Mizutani
  • Stuart E. Dreyfus
چکیده

In this paper we describe how an actor critic rein forcement learning agent in a non Markovian domain nds an optimal sequence of actions in a totally model free fashion that is the agent neither learns transitional probabilities and associated rewards nor by how much the state space should be augmented so that the Markov prop erty holds In particular we employ an Elman type re current neural network to solve non Markovian problems since an Elman type network is able to implicitly and automatically render the process Markovian A standard actor critic neural network model has two separate components the action actor network and the value critic network In animal brains however those two presumably may not be distinct but rather somehow entwined We thus construct one Elman network with two output nodes actor node and critic node and a portion of the shared hidden layer is fed back as the context layer which functions as a history memory to produce sensitiv ity to non Markovian dependencies The agent explores small scale three and four stage tri angular path networks to learn an optimal sequence of ac tions that maximizes total value or reward associated with its transition from vertex to vertex The posed prob lem has deterministic transition and reward associated with each allowable action although either could be stochastic and is rendered non Markovian by the reward being depen dent on an earlier transition Due to the nature of neural model free learning the agent needs many iterations to nd the optimal actions even in small scale path problems

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

On using discretized Cohen-Grossberg node dynamics for model-free actor-critic neural learning in non-Markovian domains

We describe how multi-stage non-Markovian decision problems can be solved using actor-critic reinforcement learning by assuming that a discrete version of CohenGrossberg node dynamics describes the node-activation computations of a neural network (NN). Our NN (i.e., agent) is capable of rendering the process Markovian implicitly and automatically in a totally model-free fashion without learning...

متن کامل

Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation

In this work, we propose to apply trust region optimization to deep reinforcement learning using a recently proposed Kronecker-factored approximation to the curvature. We extend the framework of natural policy gradient and propose to optimize both the actor and the critic using Kronecker-factored approximate curvature (K-FAC) with trust region; hence we call our method Actor Critic using Kronec...

متن کامل

Learning to Play Donkey Kong Using Neural Networks and Reinforcement Learning

Neural networks and reinforcement learning have successfully been applied to various games, such as Ms. Pacman and Go. We combine multilayer perceptrons and a class of reinforcement learning algorithms known as actor-critic to learn to play the arcade classic Donkey Kong. Two neural networks are used in this study: the actor and the critic. The actor learns to select the best action given the g...

متن کامل

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

Model-free deep reinforcement learning (RL) algorithms have been demonstrated on a range of challenging decision making and control tasks. However, these methods typically suffer from two major challenges: very high sample complexity and brittle convergence properties, which necessitate meticulous hyperparameter tuning. Both of these challenges severely limit the applicability of such methods t...

متن کامل

Incremental Multi - Step

This paper presents a novel incremental algorithm that combines Q-learning, a well-known dynamic programming-based reinforcement learning method, with the TD() return estimation process, which is typically used in actor-critic learning, another well-known dynamic programming-based reinforcement learning method. The parameter is used to distribute credit throughout sequences of actions, leading ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1998